The read_delim() function with the argument delim = "|".
read_delim(file, delim = "|")
file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?read_csv() and read_tsv() have all of the same arguments as they are both special cases of read_delim(). These arguments include col_names, col_types, locale, na, quoted_na, quote, comment, trim_ws, skip, n_max, guess_max, and progress.
read_fwf()?The most important argument to read_fwf() is col_positions (defines where the data columns begin and end).
" or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?"x,y\n1,'a,b'"
read_delim("x,y\n1,'a,b'", delim = ",", quote = "'")
## # A tibble: 1 x 2
## x y
## <int> <chr>
## 1 1 a,b
read_csv("a,b\n1,2,3\n4,5,6")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 <NA> 2 columns 3 columns literal data file 2 2 <NA> 2 columns 3 columns literal data
## # A tibble: 2 x 2
## a b
## <int> <int>
## 1 1 2
## 2 4 5
Only 2 column headers are given (a and b), but there are 3 columns in the rows. When you run the code, the values in the third column are dropped.
read_csv("a,b,c\n1,2\n1,2,3,4")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 <NA> 3 columns 2 columns literal data file 2 2 <NA> 3 columns 4 columns literal data
## # A tibble: 2 x 3
## a b c
## <int> <int> <int>
## 1 1 2 NA
## 2 1 2 3
3 column headers are given (a, b, and c), but there are only 2 columns in the first row and there are 4 columns in the last row. When you run the code, there is a missing (NA) value in the third column of the first row and the value in the fourth column of the last row is dropped.
read_csv("a,b\n\"1")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 a closing quote at end of file "" literal data file 2 1 <NA> 2 columns 1 columns literal data
## # A tibble: 1 x 2
## a b
## <int> <chr>
## 1 1 <NA>
There is no closing quote around the 1, thus the quote preceding the 1 is dropped when you run the code (and thus a is treated like an integer). There is a missing (NA) value in the first row of the b column.
read_csv("a,b\n1,2\na,b")
## # A tibble: 2 x 2
## a b
## <chr> <chr>
## 1 1 2
## 2 a b
When you run the code, a and b are both treated as character vectors since the second row contains non-numeric strings.
read_csv("a;b\n1;3")
## # A tibble: 1 x 1
## `a;b`
## <chr>
## 1 1;3
When you run the code, a;b is treated as one column header and 1:3 is treated as one value within that column (read_csv looks for commas rather than semi-colons).
locale()?encoding is the most important argument to locale() since it affects how files are read. decimal_mark and grouping_mark are also important as they define the symbols used to indicate the decimal place/to chunk larger numbers.
decimal_mark and grouping_mark to the same character? What happens to the default value of grouping_mark when you set decimal_mark to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?locale(decimal_mark = ".", grouping_mark = ".")
## Error: `decimal_mark` and `grouping_mark` must be different
If decimal_mark and grouping_mark are set to the same character, then you get an error.
locale(decimal_mark = ",")
## <locale>
## Numbers: 123.456,78
## Formats: %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
## Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
## (May), June (Jun), July (Jul), August (Aug), September
## (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM: AM/PM
If decimal_mark is set to “,” then grouping_mark is set to “.”.
locale(grouping_mark = ".")
## <locale>
## Numbers: 123.456,78
## Formats: %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
## Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
## (May), June (Jun), July (Jul), August (Aug), September
## (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM: AM/PM
If grouping_mark is set to “.” then decimal_mark is set to “,”.
date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.date_format and time_format define the default data and time formats, which are used by functions such as parse_date() and parse_time().
Examples:
parse_date("01/02/15", locale = locale(date_format = "%d/%m/%y"))
## [1] "2015-02-01"
parse_time("02-00-08 am", locale = locale(time_format = "%M-%S-%I %p"))
## 08:02:00
Not applicable.
read_csv() and read_csv2()?read_csv() uses a comma as the delimiter, whereas read_csv2() uses a semi-colon as the delimiter.
Europe: ISO 8859-1, ISO 8859-2, ISO 8859-3, etc…
Asia: Big5, GB 2312, EUC-KR, EUC-JP, etc…
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
parse_date(d1, "%B %d, %Y")
## [1] "2010-01-01"
parse_date(d2, "%Y-%b-%d")
## [1] "2015-03-07"
parse_date(d3, "%d-%b-%Y")
## [1] "2017-06-06"
parse_date(d4, "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
parse_date(d5, "%m/%d/%y")
## [1] "2014-12-30"
parse_time(t1, "%H%M")
## 17:05:00
parse_time(t2, "%H:%M:%OS %p")
## 23:15:10.12
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
In table1, each variable (country, year, cases, population) has its own column and each observation has its own row.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
In table2, each row is defined by country, year, type (“cases” or “population”), and count.
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
In table3, each row is defined by country, year, and rate (where rate is the number of cases divided by the population count as a character string).
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
Table 4 is spread across two tibbles, table4a for cases and table4b for population. Within each table, each row is defined by country and the number of cases or the population count for the year 1999 and the year 2000.
rate for table2, and table4a + table4b. You will need to perform four operations: 1) Extract the number of TB cases per country per year. 2) Extract the matching population per country per year. 3) Divide cases by population, and multiply by 10000. 4) Store back in the appropriate place. Which representation is easiest to work with? Which is hardest? Why?For table2:
countries <- filter(table2, type == 'cases')$country
years <- filter(table2, type == 'cases')$year
count_cases <- filter(table2, type == 'cases')$count
count_population <- filter(table2, type == 'population')$count
rate = count_cases/count_population * 1000
tibble(country = countries, year = years, rate = rate)
## # A tibble: 6 x 3
## country year rate
## <chr> <int> <dbl>
## 1 Afghanistan 1999 0.0373
## 2 Afghanistan 2000 0.129
## 3 Brazil 1999 0.219
## 4 Brazil 2000 0.461
## 5 China 1999 0.167
## 6 China 2000 0.167
For table4a + table4b:
countries <- table4a$country
cases_1999 <- table4a$`1999`
cases_2000 <- table4a$`2000`
population_1999 <- table4b$`1999`
population_2000 <- table4b$`2000`
rate_1999 <- tibble(country = countries, year = 1999, rate = cases_1999 / population_1999 * 10000)
rate_2000 <- tibble(country = countries, year = 2000, rate = cases_2000 / population_2000 * 10000)
(table4_rate <- rbind(rate_1999, rate_2000) %>% arrange(country))
## # A tibble: 6 x 3
## country year rate
## <chr> <dbl> <dbl>
## 1 Afghanistan 1999. 0.373
## 2 Afghanistan 2000. 1.29
## 3 Brazil 1999. 2.19
## 4 Brazil 2000. 4.61
## 5 China 1999. 1.67
## 6 China 2000. 1.67
table2 is easier to work with since all of the data is in one table.
table2 instead of table1. What do you need to do first?First need to filter table2 to have rows with type == cases only.
table2_cases = filter(table2, type == "cases")
ggplot(data = table2_cases, aes(year, count)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(color = country))
gather() and spread() not perfectly symmetrical? Carefully consider the following example:(stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
))
## # A tibble: 4 x 3
## year half return
## <dbl> <dbl> <dbl>
## 1 2015. 1. 1.88
## 2 2015. 2. 0.590
## 3 2016. 1. 0.920
## 4 2016. 2. 0.170
stocks %>%
spread(year, return) %>%
gather("year", "return", `2015`:`2016`)
## # A tibble: 4 x 3
## half year return
## <dbl> <chr> <dbl>
## 1 1. 2015 1.88
## 2 2. 2015 0.590
## 3 1. 2016 0.920
## 4 2. 2016 0.170
spread() and gather() have a convert argument. What does it do?spread() and gather() are not perfectly symmetrical because information about column types is not transferred between them. In the example above, after running spread() and gather(), the year column changes from type dbl to type chr. This is because after running spread(), 2015 and 2016 become column names, which are then treated as strings when using gather().
table4a %>%
gather(1999, 2000, key = "year", value = "cases")
## Error in inds_combine(.vars, ind_list): Position must be between 0 and n
This code fails because 1999 and 2000 need to be in quotes (since they refer to names of columns in table4a rather than the 1999th and 2000th columns of table4a).
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
people <- tribble(
~name, ~key, ~value,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
spread(people, key, value)
## Error: Duplicate identifiers for rows (1, 3)
Spreading this tibble fails because both the first and third rows contain a different value for “age” of “Phillip Woods”. A new column with the number of the observation could be added to fix the problem, as shown below.
people <- tribble(
~name, ~key, ~value, ~obs_num,
#-----------------|--------|------|------
"Phillip Woods", "age", 45, 1,
"Phillip Woods", "height", 186, 1,
"Phillip Woods", "age", 50, 2,
"Jessica Cordero", "age", 37, 1,
"Jessica Cordero", "height", 156, 1
)
spread(people, key, value)
## # A tibble: 3 x 4
## name obs_num age height
## <chr> <dbl> <dbl> <dbl>
## 1 Jessica Cordero 1. 37. 156.
## 2 Phillip Woods 1. 45. 186.
## 3 Phillip Woods 2. 50. NA
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
The tibble needs to be gathered. The variables are sex (male or female), pregnant (yes or no), and count.
preg %>%
gather(male, female, key = "sex", value = "count", na.rm = TRUE)
## # A tibble: 3 x 3
## pregnant sex count
## * <chr> <chr> <dbl>
## 1 no male 20.
## 2 yes female 10.
## 3 no female 12.
extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
The extra argument in separate() controls what happens when there are too many pieces. The default behavior is to emit a warning and drop extra values (warn). Alternative options are to drop any extra values without a warning (drop) or to split the extra values (merge).
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "drop")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "merge")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f,g
## 3 h i j
The fill argument in separate() controls what happens when there are not enough pieces. The default behavior is to emit a warning and fill from the right (warn). Alternative options are to fill with missing values on the right (right) or to fill with missing values on the left (left).
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "right")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "left")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 <NA> d e
## 3 f g i
unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?If set to TRUE, the remove argument in unite() and separate()removes the input column from the output data frame. You would set it to FALSE if you want to create a new variable while keeping the old variable.
table1 %>%
separate(year, into = c("century", "year_2"), sep = 2, remove = TRUE)
## # A tibble: 6 x 5
## country century year_2 cases population
## <chr> <chr> <chr> <int> <int>
## 1 Afghanistan 19 99 745 19987071
## 2 Afghanistan 20 00 2666 20595360
## 3 Brazil 19 99 37737 172006362
## 4 Brazil 20 00 80488 174504898
## 5 China 19 99 212258 1272915272
## 6 China 20 00 213766 1280428583
table1 %>%
separate(year, into = c("century", "year_2"), sep = 2, remove = FALSE)
## # A tibble: 6 x 6
## country year century year_2 cases population
## <chr> <int> <chr> <chr> <int> <int>
## 1 Afghanistan 1999 19 99 745 19987071
## 2 Afghanistan 2000 20 00 2666 20595360
## 3 Brazil 1999 19 99 37737 172006362
## 4 Brazil 2000 20 00 80488 174504898
## 5 China 1999 19 99 212258 1272915272
## 6 China 2000 20 00 213766 1280428583
With remove = TRUE, year is removed from the data frame; with remove = FALSE, year remains in the data frame.
separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite?extract() uses regular expression to capture groups and turn each group into a new column. separate() uses either regular expression or a vector of character positions to turn a single character column into multiple columns. While there are several ways to separate a column into multiple columns, there is only way to put together multiple columns into a single column, thus it makes sense that there are 3 variations of separation but only one unite.
fill arguments to spread() and complete().If fill is set in spread(), all explicit missing values (i.e. NA) and implicit missings (rows that aren’t present) will be replaced by the fill value. In complete(), NAs under different variables can be replaced by different values. The fill argument in complete() takes in a list specifying the values with which to replace NA with for each variable, while the fill argument in spread() only takes in one value.
direction argument to fill() do?The direction argument to fill() specifies the direction in which to fill missing values. The default is “down” (any NAs will be replaced by the previous non-missing value) but it can also be set to “up”.
na.rm = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?who %>%
group_by(country) %>%
summarize(year_min = min(year), year_max = max(year)) %>%
ggplot() +
geom_point(mapping = aes(x = year_min, y = country), color = 'red') +
geom_point(mapping = aes(x = year_max , y= country), color = 'blue')
While most countries have data starting in 1980 and ending in 2013, there are implicit missing values, as shown by the plot above.
sum(who %>% select(-c(1:4)) == 0, na.rm = TRUE)
## [1] 11080
sum(who %>% select(-c(1:4)) %>% sapply(function(x){sum(is.na(x))}))
## [1] 329394
There are 11080 zero values (which indicate there were no cases of TB reported) and 329394 NA values. Setting na.rm = TRUE seems reasonable in this case.
mutate() step? (mutate(key = stringr::str_replace(key, "newrel", "new_rel")))who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE) %>%
#mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "type", "sexage"), sep = "_") %>%
tail()
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows
## [73467, 73468, 73469, 73470, 73471, 73472, 73473, 73474, 73475, 73476,
## 73477, 73478, 73479, 73480, 73481, 73482, 73483, 73484, 73485, 73486, ...].
## # A tibble: 6 x 8
## country iso2 iso3 year new type sexage cases
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Venezuela (Bolivarian Republ… VE VEN 2013 newr… f65 <NA> 402
## 2 Viet Nam VN VNM 2013 newr… f65 <NA> 3110
## 3 Wallis and Futuna Islands WF WLF 2013 newr… f65 <NA> 2
## 4 Yemen YE YEM 2013 newr… f65 <NA> 360
## 5 Zambia ZM ZMB 2013 newr… f65 <NA> 669
## 6 Zimbabwe ZW ZWE 2013 newr… f65 <NA> 725
If you neglect the mutate() step, then the sexage column will have a value of NA for keys that began with “newrel”.
iso2 and iso3 were redundant with country. Confirm this claim.select(who, country, iso2, iso3) %>%
distinct() %>%
group_by(country) %>%
filter(n() > 1)
## # A tibble: 0 x 3
## # Groups: country [0]
## # ... with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>
who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE) %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "type", "sexage"), sep = "_") %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1) %>%
group_by(country, year, sex) %>%
summarise(cases = sum(cases)) %>%
unite(country_sex, country, sex, remove = FALSE) %>%
ggplot(aes(x = year, y = cases, group = country_sex, colour = sex)) +
geom_line()
You would need to combine the flights and airports tables, as you would need to match the dest and origin variables from flights with the faa variable from airports.
weather and airports. What is the relationship and how should it appear in the diagram?weather and airports can be matched using the variable origin in weather and the variable faa in airports.
weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?If it contained weather records for all airports in the USA, you could also connect weather with flights through dest.
You could create a table with columns for year, month, day, and holiday_name. The primary key would be the unique date (year, month, day combination) of the holiday, and you could connect the table to existing tables (for example, flights and weather) based on the date.
flights %>%
mutate(flight_id = row_number(year)) %>%
select(flight_id, everything())
## # A tibble: 336,776 x 20
## flight_id year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <int> <dbl> <int>
## 1 1 2013 1 1 517 515 2. 830
## 2 2 2013 1 1 533 529 4. 850
## 3 3 2013 1 1 542 540 2. 923
## 4 4 2013 1 1 544 545 -1. 1004
## 5 5 2013 1 1 554 600 -6. 812
## 6 6 2013 1 1 554 558 -4. 740
## 7 7 2013 1 1 555 600 -5. 913
## 8 8 2013 1 1 557 600 -3. 709
## 9 9 2013 1 1 557 600 -3. 838
## 10 10 2013 1 1 558 600 -2. 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
#Lahman::Batting %>% head()
Lahman::Batting %>%
group_by(playerID, yearID, stint) %>%
mutate(n = n()) %>%
filter(n > 1) %>%
nrow()
## [1] 0
The primary key for Lahman::Batting is playerID, yearID, stint.
#babynames::babynames %>% head()
babynames::babynames %>%
group_by(year, sex, name) %>%
mutate(n = n()) %>%
filter(n > 1) %>%
nrow()
## [1] 0
The primary key for babynames::babynames is year, sex, name.
#nasaweather::atmos %>% head()
nasaweather::atmos %>%
group_by(lat, long, year, month) %>%
mutate(n = n()) %>%
filter(n > 1) %>%
nrow()
## [1] 0
The primary key for nasaweather::atmos is lat, long, year, and month.
#fueleconomy::vehicles %>% head()
fueleconomy::vehicles %>%
group_by(id) %>%
mutate(n = n()) %>%
filter(n > 1) %>%
nrow()
## [1] 0
The primary key for fueleconomy::vehicles is id.
#ggplot2::diamonds %>% head()
ggplot2::diamonds %>%
distinct() %>%
nrow()
## [1] 53794
nrow(ggplot2::diamonds)
## [1] 53940
There is no primary key for ggplot2::diamonds. The number of distinct rows in the dataset is less than the total number of rows, suggesting that no combination of variables uniquely identifies the observations.
Batting, Master, and Salaries tables in the Lahman package. Draw another diagram that shows the relationship between Master, Managers, AwardsManagers. How would you characterise the relationship between the Batting, Pitching, and Fielding tables?library(datamodelr)
dm1 <- dm_from_data_frames(list(Batting = Lahman::Batting,
Master = Lahman::Master,
Salaries = Lahman::Salaries)) %>%
dm_set_key("Batting", c("playerID", "yearID", "stint")) %>%
dm_set_key("Master", "playerID") %>%
dm_set_key("Salaries", c("yearID", "teamID", "playerID")) %>%
dm_add_references(
Batting$playerID == Master$playerID,
Salaries$playerID == Master$playerID
)
graph1 <- dm_create_graph(dm1, rankdir = "LR", columnArrows = TRUE)
dm_render_graph(graph1)
library(datamodelr)
dm2 <- dm_from_data_frames(list(Master = Lahman::Master,
Managers = Lahman::Managers,
AwardsManagers = Lahman::AwardsManagers)) %>%
dm_set_key("Master", "playerID") %>%
dm_set_key("Managers", c("yearID", "teamID", "inseason")) %>%
dm_set_key("AwardsManagers", c("playerID", "awardID", "yearID")) %>%
dm_add_references(
Managers$playerID == Master$playerID,
AwardsManagers$playerID == Master$playerID
)
graph2 <- dm_create_graph(dm2, rankdir = "LR", columnArrows = TRUE)
dm_render_graph(graph2)
Batting and Master can be matched by playerID. Salary can then be matched with Batting by playerID and yearID. Master and Managers can be matched by playerID. Managers and AwardsManagers can be matched by playerID and yearID.
Batting, Pitching, and Fielding all have a primary key of playerID, yearID, and stint, and have a one-to-one relationship to each other.
airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:airports %>%
semi_join(flights, c("faa" = "dest")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap()
flights %>%
group_by(dest) %>%
summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
left_join(airports, by = c('dest' = 'faa')) %>%
ggplot(aes(x = lon, y = lat, size = avg_arr_delay, color = avg_arr_delay)) +
borders('state') +
geom_point() +
coord_quickmap()
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
## Warning: Removed 5 rows containing missing values (geom_point).
lat and lon) to flights.flights %>%
left_join(airports, by = c('dest' = 'faa')) %>%
left_join(airports, by = c('origin' = 'faa'), suffix = c('.dest', '.origin')) %>%
select(dest, origin, lat.dest, lon.dest, lat.origin, lon.origin)
## # A tibble: 336,776 x 6
## dest origin lat.dest lon.dest lat.origin lon.origin
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 IAH EWR 30.0 -95.3 40.7 -74.2
## 2 IAH LGA 30.0 -95.3 40.8 -73.9
## 3 MIA JFK 25.8 -80.3 40.6 -73.8
## 4 BQN JFK NA NA 40.6 -73.8
## 5 ATL LGA 33.6 -84.4 40.8 -73.9
## 6 ORD EWR 42.0 -87.9 40.7 -74.2
## 7 FLL EWR 26.1 -80.2 40.7 -74.2
## 8 IAD LGA 38.9 -77.5 40.8 -73.9
## 9 MCO JFK 28.4 -81.3 40.6 -73.8
## 10 ORD LGA 42.0 -87.9 40.8 -73.9
## # ... with 336,766 more rows
plane_age <- planes %>%
mutate(age = 2013 - year) %>%
select(tailnum, age) %>%
filter(!is.na(age))
flights %>%
inner_join(plane_age, by = "tailnum") %>%
group_by(age) %>%
filter(!is.na(dep_delay)) %>%
summarise(delay = mean(dep_delay)) %>%
ggplot(aes(x = age, y = delay)) +
geom_point() +
geom_line()
No clear relationship between the age of a plane and its delays - from the plot, delay seems to decrease as the age of the plane increases, but this could be due to other factors as well.
flight_weather <- flights %>%
inner_join(weather, by = c("origin" = "origin",
"year" = "year",
"month" = "month",
"day" = "day",
"hour" = "hour"))
flight_weather %>%
group_by(precip) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = precip, y = delay)) +
geom_line() +
geom_point()
flight_weather %>%
group_by(wind_speed) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = wind_speed, y = delay)) +
geom_line() +
geom_point()
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
flight_weather %>%
group_by(humid) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = humid, y = delay)) +
geom_line() +
geom_point()
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_point).
Increased windspeed and increased precipitation appear to make it more likely to see a delay.
flights %>%
filter(year == 2013, month == 6, day == 13) %>%
group_by(dest) %>%
summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
left_join(airports, by = c('dest' = 'faa')) %>%
ggplot(aes(x = lon, y = lat, size = avg_arr_delay, color = avg_arr_delay)) +
borders('state') +
geom_point(alpha = .5) +
scale_color_continuous(low = 'yellow', high = 'red') +
coord_quickmap()
## Warning: Removed 7 rows containing missing values (geom_point).
There was a severe storm along the East Coast, resulting in increased delays in those areas.
tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)flights %>%
anti_join(planes, by = "tailnum") %>%
count(carrier, sort = TRUE)
## # A tibble: 10 x 2
## carrier n
## <chr> <int>
## 1 MQ 25397
## 2 AA 22558
## 3 UA 1693
## 4 9E 1044
## 5 B6 830
## 6 US 699
## 7 FL 187
## 8 DL 110
## 9 F9 50
## 10 WN 38
Many of the flights with tail numbers without a matching record in planes have the carrier “MQ” or “AA” - maybe these carriers just don’t report tail numbers.
flights_100 <- flights %>%
filter(!is.na(dep_delay)) %>%
group_by(tailnum) %>%
summarize(n = n()) %>%
filter(n > 100)
flights %>%
semi_join(flights_100, by = 'tailnum')
## # A tibble: 225,766 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 544 545 -1. 1004
## 4 2013 1 1 554 558 -4. 740
## 5 2013 1 1 555 600 -5. 913
## 6 2013 1 1 557 600 -3. 709
## 7 2013 1 1 557 600 -3. 838
## 8 2013 1 1 558 600 -2. 849
## 9 2013 1 1 558 600 -2. 853
## 10 2013 1 1 558 600 -2. 923
## # ... with 225,756 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models.fueleconomy::vehicles %>%
semi_join(fueleconomy::common, by = c('make', 'model'))
## # A tibble: 14,531 x 12
## id make model year class trans drive cyl displ fuel hwy cty
## <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl> <chr> <int> <int>
## 1 1833 Acura Inte… 1986 Subc… Auto… Fron… 4 1.60 Regu… 28 22
## 2 1834 Acura Inte… 1986 Subc… Manu… Fron… 4 1.60 Regu… 28 23
## 3 3037 Acura Inte… 1987 Subc… Auto… Fron… 4 1.60 Regu… 28 22
## 4 3038 Acura Inte… 1987 Subc… Manu… Fron… 4 1.60 Regu… 28 23
## 5 4183 Acura Inte… 1988 Subc… Auto… Fron… 4 1.60 Regu… 27 22
## 6 4184 Acura Inte… 1988 Subc… Manu… Fron… 4 1.60 Regu… 28 23
## 7 5303 Acura Inte… 1989 Subc… Auto… Fron… 4 1.60 Regu… 27 22
## 8 5304 Acura Inte… 1989 Subc… Manu… Fron… 4 1.60 Regu… 28 23
## 9 6442 Acura Inte… 1990 Subc… Auto… Fron… 4 1.80 Regu… 24 20
## 10 6443 Acura Inte… 1990 Subc… Manu… Fron… 4 1.80 Regu… 26 21
## # ... with 14,521 more rows
weather data. Can you see any patterns?flights_48 <- flights %>%
group_by(year, month, day) %>%
summarise(delay_24h = sum(dep_delay, na.rm = TRUE) + sum(arr_delay, na.rm = TRUE)) %>%
mutate(delay_48h = delay_24h + lag(delay_24h)) %>%
filter(!(is.na(delay_48h))) %>%
arrange(desc(delay_48h))
weather_48 <- weather %>%
group_by(year, month, day) %>%
summarize_at(vars(humid, precip, temp, visib), mean, na.rm = TRUE)
flights_48 %>%
left_join(weather_48) %>%
head(10)
## Joining, by = c("year", "month", "day")
## # A tibble: 10 x 9
## # Groups: year, month [6]
## year month day delay_24h delay_48h humid precip temp visib
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013. 7. 23 80641. 175419. 74.6 0.0154 80.6 9.33
## 2 2013. 3. 8 135264. 167530. 80.4 0.0192 35.5 4.91
## 3 2013. 6. 25 80434. 166649. 60.3 0.00153 81.9 9.97
## 4 2013. 8. 9 72866. 165287. 82.8 0.00153 78.5 9.08
## 5 2013. 6. 28 81389. 157910. 73.1 0. 78.0 9.82
## 6 2013. 7. 10 97120. 157396. 73.2 0.00111 80.6 9.24
## 7 2013. 4. 19 82556. 150125. 86.1 0.000833 61.3 5.25
## 8 2013. 3. 9 3243. 138507. 46.4 0. 43.5 10.0
## 9 2013. 5. 24 51662. 137740. 84.1 0.0106 57.3 8.84
## 10 2013. 6. 14 39422. 136526. 70.6 0.0100 62.5 9.44
flights_48 %>%
left_join(weather_48) %>%
tail(10)
## Joining, by = c("year", "month", "day")
## # A tibble: 10 x 9
## # Groups: year, month [3]
## year month day delay_24h delay_48h humid precip temp visib
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013. 9. 10 -5264. -16039. 73.0 0. 76.3 9.79
## 2 2013. 8. 27 -8109. -16654. 66.3 0.000833 78.9 9.99
## 3 2013. 8. 26 -8545. -16884. 59.1 0.000417 75.5 10.0
## 4 2013. 9. 9 -10775. -19308. 55.0 0. 65.2 10.0
## 5 2013. 9. 18 -9508. -19428. 55.4 0. 61.2 10.0
## 6 2013. 9. 8 -8533. -22665. 51.2 0. 73.3 10.0
## 7 2013. 9. 5 -15387. -28961. 48.8 0. 71.9 10.0
## 8 2013. 9. 7 -14132. -31731. 50.5 0. 68.7 10.0
## 9 2013. 10. 2 -13623. -31919. 58.9 0. 73.3 9.97
## 10 2013. 9. 6 -17599. -32986. 45.5 0. 64.1 10.0
Precipitation and temperature appear to be slightly higher and visibility slightly lower for the worst delays compared to the smallest delays.
anti_join(flights, airports, by = c("dest" = "faa")) tell you? What does anti_join(airports, flights, by = c("faa" = "dest")) tell you?anti_join(flights, airports, by = c("dest" = "faa"))
## # A tibble: 7,602 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 544 545 -1. 1004
## 2 2013 1 1 615 615 0. 1039
## 3 2013 1 1 628 630 -2. 1137
## 4 2013 1 1 701 700 1. 1123
## 5 2013 1 1 711 715 -4. 1151
## 6 2013 1 1 820 820 0. 1254
## 7 2013 1 1 820 820 0. 1249
## 8 2013 1 1 840 845 -5. 1311
## 9 2013 1 1 909 810 59. 1331
## 10 2013 1 1 913 918 -5. 1346
## # ... with 7,592 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Gives flights whose destinations are not present in airports dataset.
anti_join(airports, flights, by = c("faa" = "dest"))
## # A tibble: 1,357 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5. A America/New_…
## 2 06A Moton Field Municip… 32.5 -85.7 264 -6. A America/Chic…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6. A America/Chic…
## 4 06N Randall Airport 41.4 -74.4 523 -5. A America/New_…
## 5 09J Jekyll Island Airpo… 31.1 -81.4 11 -5. A America/New_…
## 6 0A9 Elizabethton Munici… 36.4 -82.2 1593 -5. A America/New_…
## 7 0G6 Williams County Air… 41.5 -84.5 730 -5. A America/New_…
## 8 0G7 Finger Lakes Region… 42.9 -76.8 492 -5. A America/New_…
## 9 0P2 Shoestring Aviation… 39.8 -76.6 1000 -5. U America/New_…
## 10 0S9 Jefferson County In… 48.1 -123. 108 -8. A America/Los_…
## # ... with 1,347 more rows
Gives airports that no flights in flights dataset are flying to (airports that are not destinations of flights in flights dataset).
flights %>%
select(carrier, tailnum) %>%
group_by(tailnum) %>%
summarize(n = length(unique(carrier))) %>%
filter(n > 1)
## # A tibble: 18 x 2
## tailnum n
## <chr> <int>
## 1 N146PQ 2
## 2 N153PQ 2
## 3 N176PQ 2
## 4 N181PQ 2
## 5 N197PQ 2
## 6 N200PQ 2
## 7 N228PQ 2
## 8 N232PQ 2
## 9 N933AT 2
## 10 N935AT 2
## 11 N977AT 2
## 12 N978AT 2
## 13 N979AT 2
## 14 N981AT 2
## 15 N989AT 2
## 16 N990AT 2
## 17 N994AT 2
## 18 <NA> 7
Hypothesis rejected - some planes are flown by more than 1 airline.
rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?ggplot(gss_cat, aes(rincome)) +
geom_bar()
The default bar chart is hard to understand since the labels on the x axis are too crowded. This can be improved by flipping the x and y axes using coord_flip().
ggplot(gss_cat, aes(rincome)) +
geom_bar() +
coord_flip()
relig in this survey? What’s the most common partyid?gss_cat %>%
count(relig) %>%
arrange(desc(n)) %>%
head(1)
## # A tibble: 1 x 2
## relig n
## <fct> <int>
## 1 Protestant 10846
The most common relig in the survey is Protestant.
gss_cat %>%
count(partyid) %>%
arrange(desc(n)) %>%
head(1)
## # A tibble: 1 x 2
## partyid n
## <fct> <int>
## 1 Independent 4119
The most common partyid in the survey is Independent.
relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?denom applies to relig == Protestant.
With a table:
gss_cat %>%
count(relig, denom) %>%
filter(!denom %in% c("No answer", "Other", "Don't know", "Not applicable", "No denomination"))
## # A tibble: 25 x 3
## relig denom n
## <fct> <fct> <int>
## 1 Protestant Episcopal 397
## 2 Protestant Presbyterian-dk wh 244
## 3 Protestant Presbyterian, merged 67
## 4 Protestant Other presbyterian 47
## 5 Protestant United pres ch in us 110
## 6 Protestant Presbyterian c in us 104
## 7 Protestant Lutheran-dk which 267
## 8 Protestant Evangelical luth 122
## 9 Protestant Other lutheran 30
## 10 Protestant Wi evan luth synod 71
## # ... with 15 more rows
With a visualisation:
gss_cat %>%
count(relig, denom) %>%
ggplot(aes(relig, denom)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90))
tvhours. Is the mean a good summary?summary(gss_cat[["tvhours"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 2.000 2.981 4.000 24.000 10146
No, the median would likely provide a better summary.
gss_cat identify whether the order of the levels is arbitrary or principled.fct_gss <- gss_cat[sapply(gss_cat, is.factor)]
lapply(fct_gss, levels)
## $marital
## [1] "No answer" "Never married" "Separated" "Divorced"
## [5] "Widowed" "Married"
##
## $race
## [1] "Other" "Black" "White" "Not applicable"
##
## $rincome
## [1] "No answer" "Don't know" "Refused" "$25000 or more"
## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
##
## $partyid
## [1] "No answer" "Don't know" "Other party"
## [4] "Strong republican" "Not str republican" "Ind,near rep"
## [7] "Independent" "Ind,near dem" "Not str democrat"
## [10] "Strong democrat"
##
## $relig
## [1] "No answer" "Don't know"
## [3] "Inter-nondenominational" "Native american"
## [5] "Christian" "Orthodox-christian"
## [7] "Moslem/islam" "Other eastern"
## [9] "Hinduism" "Buddhism"
## [11] "Other" "None"
## [13] "Jewish" "Catholic"
## [15] "Protestant" "Not applicable"
##
## $denom
## [1] "No answer" "Don't know" "No denomination"
## [4] "Other" "Episcopal" "Presbyterian-dk wh"
## [7] "Presbyterian, merged" "Other presbyterian" "United pres ch in us"
## [10] "Presbyterian c in us" "Lutheran-dk which" "Evangelical luth"
## [13] "Other lutheran" "Wi evan luth synod" "Lutheran-mo synod"
## [16] "Luth ch in america" "Am lutheran" "Methodist-dk which"
## [19] "Other methodist" "United methodist" "Afr meth ep zion"
## [22] "Afr meth episcopal" "Baptist-dk which" "Other baptists"
## [25] "Southern baptist" "Nat bapt conv usa" "Nat bapt conv of am"
## [28] "Am bapt ch in usa" "Am baptist asso" "Not applicable"
For marital, race, partyid, relig, and denom, the order of the levels is arbitrary. For rincome, the order lof levels is principled.
Moving “Not applicable” to the front of the levels gives it an integer value of 1.
levels(gss_cat$partyid)
## [1] "No answer" "Don't know" "Other party"
## [4] "Strong republican" "Not str republican" "Ind,near rep"
## [7] "Independent" "Ind,near dem" "Not str democrat"
## [10] "Strong democrat"
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
Democrat = c("Not str democrat", "Strong democrat"),
Republican = c("Strong republican", "Not str republican"),
Independent = c("Ind,near rep", "Independent", "Ind,near dem"),
Other = c("No answer", "Don't know", "Other party"))) %>%
count(year, partyid) %>%
group_by(year) %>%
mutate(p = n / sum(n)) %>%
ggplot(aes(year, p, group = partyid, color = partyid)) +
geom_point() +
geom_line()
The number of Independents is increasing, the number of Democrats has fluctuated but overall remains at a steady level, and the number of Republicans is decreasing.
rincome into a small set of categories?gss_cat %>%
mutate(rincome = fct_collapse(rincome, `Unknown` = c("No answer", "Don't know", "Refused", "Not applicable"),
`Lt $5000` = c("Lt $1000", str_c("$", c("1000", "3000", "4000"), " to ", c("2999", "3999", "4999"))),
`$5000 to 10000` = str_c("$", c("5000", "6000", "7000", "8000"), " to ", c("5999", "6999", "7999", "9999")))) %>%
ggplot(aes(rincome)) +
geom_bar() +
coord_flip()